Stroke was a rare outcome (~5%), creating strong class imbalance and making detection of positive cases difficult.
Key predictors across models were age, hypertension, heart disease, average glucose, and smoking — consistent with established clinical risk factors.
Logistic Regression showed good discrimination (AUC ≈ 0.78) and remains a strong interpretable baseline model.
Despite good AUC, sensitivity at the default 0.5 threshold was extremely low due to class imbalance.
Lowering the threshold to 0.2 improved sensitivity (~22%) while maintaining high specificity (~95%), offering a more clinically reasonable trade-off.
Tree-based ensembles (RF, GBM) achieved slightly higher AUC but did not dramatically improve sensitivity and were less interpretable.
Accuracy and specificity were high for all models, but misleading, as they mainly reflected the dominance of the non-stroke class.
Results show that routine health indicators can meaningfully separate higher- vs lower-risk individuals, but handling class imbalance is critical.